Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Dec 12, 2025

Summary

This PR introduces C++ RAII-based resource handles for CUDA objects in cuda.core. The handles use std::shared_ptr with custom deleters to ensure proper destruction order and prevent resource leaks.

For a detailed explanation of the design, see cuda_core/cuda/core/_cpp/DESIGN.md. (Tip: Select ⋯ → View file to see the formatted Markdown.)

Handle Types

  • ContextHandle: Wraps CUcontext with proper retain/release semantics for primary contexts
  • StreamHandle: Wraps CUstream, holds owning reference to its context
  • EventHandle: Wraps CUevent, holds owning reference to its context
  • MemoryPoolHandle: Wraps CUmemoryPool, includes nvbug 5698116 workaround (clears peer access before destruction)
  • DevicePtrHandle: Wraps CUdeviceptr, holds owning reference to memory pool (if applicable), includes IPC pointer cache

Driver Workarounds

  • nvbug 5570902: The IPC pointer cache works around a driver bug where multiple imports of the same device pointer via IPC are not reference-counted correctly.
  • Pinned memory IPC: The cache also enables multiple imports of pinned memory buffers, working around CUDA_ERROR_ALREADY_MAPPED that the driver returns on duplicate imports.

Benefits

  • Eliminates noisy tracebacks from incorrect destruction order (see #1303 comments)
  • Enables safe resource ownership transfer between Python and C++
  • Resources remain valid independent of Python garbage collection timing

Related Issues

Test Plan

  • All existing tests pass
  • Pre-commit checks pass
  • New test for IPC duplicate import (nvbug 5570902)
  • Manual verification of destruction order fixes

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost Andy-Jost self-assigned this Dec 12, 2025
- Create _context.pxd with Context class declaration
- Move get_primary_context from _device.pyx to _context.pyx
- Add Cython-level context functions: get_current_context, set_current_context, get_stream_context
- Update _device.pyx to use context module functions
- Update _stream.pyx to use context module functions
- Remove push_context and pop_context (unused, replaced with direct CUDA API calls)
- Reorganize _context.pyx according to style guide (principal class first)
- Replace Context._handle (object) with ContextHandle (shared_ptr) resource handle
- Add handle property to Context returning driver.CUcontext
- Update Context._from_ctx to create ContextHandle using create_context_handle()
- Update Context.__eq__ to compare actual CUcontext values (not shared_ptr addresses)
- Update Context.__hash__ to include type(self) and handle value with NULL safety
- Update _device.pyx to use ctx._handle.get()[0] for direct access
- Update _graph.pyx to use context.handle property
- Update C++ implementation to use default deleter (simplifies code)
- Rename _resource_handles_impl.cpp to _context_impl.cpp
- Remove test_dev.py development script
- Update .gitignore to allow *_impl.cpp files
- Fix all test files to use context.handle instead of context._handle
- switch context helper APIs to return ContextHandle instead of raw CUcontext
- add TLS wrapper for primary context caching using handles
- update device/stream code to consume ContextHandle-based helpers
- expose create_context_handle_ref as nogil-safe in the pxd
Introduce three helper functions for ContextHandle resource extraction:
- native(h): Returns cydriver.CUcontext for use with cydriver API calls
- py(h): Returns driver.CUcontext for use with Python driver API
- intptr(h): Returns uintptr_t for internal APIs expecting integer addresses

These helpers replace direct h_context.get()[0] calls, providing:
- Cleaner, more semantic code
- Consistent extraction pattern across all handle types
- Type-safe conversions with clear intent

Implementation details:
- native() and intptr() are inline nogil functions in .pxd
- py() requires Python module access, implemented in new _resource_handles.pyx
- Updated all call sites in _context, _device, and _stream modules
Move get_primary_context/get_current_context into C++ with thread-local caching and conditional GIL release; inline create_context_handle_ref in the header; update Cython modules and build hooks to link handle users (including _device) against resource_handles and libcuda.
The C++ implementation in _resource_handles_impl.cpp is now compiled
only into _resource_handles.so. Other modules that depend on these
symbols (_context, _device, etc.) resolve them at runtime via the
global symbol table.

This ensures a single shared instance of thread-local caches and
avoids setuptools issues with shared source files across extensions.
Move native(), intptr(), and py() from Cython inline functions to
inline C++ functions in resource_handles.hpp. This enables function
overloading when additional handle types (e.g., StreamHandle) are added.

- native(): extract raw CUDA handle from ContextHandle
- intptr(): extract handle as uintptr_t for Python interop
- py(): convert handle to Python driver wrapper object
Add StreamHandle for automatic stream lifetime management using the same
shared_ptr-based pattern established for ContextHandle.

Changes:
- Add StreamHandle type and create_stream_handle/create_stream_handle_ref
  functions in C++ with implementations in _resource_handles_impl.cpp
- Add overloaded native(), intptr(), py() helpers for StreamHandle
- Update Stream class to use _h_stream (StreamHandle) instead of raw _handle
- Owned streams are automatically destroyed when last reference is released
- Borrowed streams (from __cuda_stream__ protocol) hold _owner reference
- Update memory resource files to use native(stream._h_stream)
- Simplify Context using intptr() and py() helpers
- Move stream creation to C++ (create_stream_handle now calls
  cuStreamCreateWithPriority internally)
- Add get_legacy_stream/get_per_thread_stream for built-in streams
- Add create_stream_handle_with_owner for borrowed streams that
  prevents Python owner from being GC'd via captured PyObject*
- Add GILAcquireGuard (symmetric to GILReleaseGuard) for safely
  acquiring GIL in C++ destructors
- Simplify Stream class: remove __cinit__, _owner, _builtin,
  _legacy_default, _per_thread_default
- Use _from_handle as single initialization point for Stream
- Remove obsolete subclassing tests for removed methods
- Replace raw CUcontext _ctx_handle with ContextHandle _h_context
  for consistent handle paradigm and cleaner code
- Replace CUdevice _device_id with int using -1 sentinel
- Use intptr() helper instead of <uintptr_t>() casts throughout
- Add _from_handle(type cls, ...) factory with subclass support
- Add _legacy_default and _per_thread_default classmethods
- Eliminate duplicated initialization code in _init
- Event now uses ContextHandle for _h_context instead of raw object
- Event._init is now a cdef staticmethod accepting ContextHandle
- Context._from_ctx renamed to Context._from_handle (cdef staticmethod)
- Moved get_device_from_ctx to Stream module as Stream_ensure_ctx_device
- Inlined get_stream_context into Stream_ensure_ctx
- Simplified context push/pop logic in Stream_ensure_ctx_device

Naming standardization:
- Device._id -> Device._device_id
- _dev_id -> _device_id throughout codebase
- dev_id -> device_id for local variables
- Updated tests to use public APIs instead of internal _init methods
Device now stores its Context in _context slot, set during set_current().
This ensures Device holds an owning reference to its context, enabling
proper lifetime management when passed to Stream and Event creation.

Changes:
- Add _context to Device.__slots__
- Store Context in set_current() for both primary and explicit context paths
- Simplify context property to return stored _context
- Update create_event() to use self._context._h_context
- Remove get_current_context import (no longer needed in _device.pyx)

Add structural context dependency to owned streams

StreamBox now holds ContextHandle to ensure context outlives the stream.
This structural dependency is only for owned streams - borrowed streams
delegate context lifetime management to their owners.

C++ changes:
- StreamBox gains h_context member
- create_stream_handle(h_ctx, flags, priority) takes owning context
- create_stream_handle_ref(stream) - caller manages context
- create_stream_handle_with_owner(stream, owner) - Python owner manages context

Cython/Python changes:
- Stream._init() accepts optional ctx parameter
- Device.create_stream() passes self._context to Stream._init()
- Owned streams get context handle embedded in C++ handle
Event now uses EventHandle (shared_ptr) for RAII-based lifetime management,
following the same pattern as Stream.

C++ changes:
- Add EventHandle type alias and EventBox struct
- Add create_event_handle(h_ctx, flags) with context captured in deleter
- Add create_event_handle_ipc(ipc_handle) for IPC events (no context dep)
- Add native(), intptr(), py() overloads for EventHandle

Cython changes:
- Event._h_event replaces raw CUevent _handle
- _init() uses create_event_handle()
- from_ipc_descriptor() uses create_event_handle_ipc()
- close() uses _h_event.reset()
- Keep _h_context for cached fast access
- Simplified branch structure: early return for Event, single path for Stream
- Use native() helper for handle access instead of casting via handle property
- Temporary events now use EventHandle with RAII cleanup (no explicit cuEventDestroy)
- Added create_event_handle import
- New overload takes only flags (no ContextHandle) for temporary events
- Delegates to existing overload with empty ContextHandle
- Updated _stream.pyx and _memoryview.pyx to use simpler overload
- Removed unnecessary get_current_context import from _memoryview.pyx
- Removed unnecessary Stream_ensure_ctx call from Stream.wait()
C++ layer (resource_handles.hpp/cpp):
- Add MemoryPoolHandle = std::shared_ptr<const CUmemoryPool>
- Add create_mempool_handle(props) - owning, calls cuMemPoolDestroy on release
- Add create_mempool_handle_ref(pool) - non-owning reference
- Add create_mempool_handle_ipc(fd, handle_type) - owning from IPC import
- Add get_device_mempool(device_id) - get current pool for device (non-owning)
- Add native(), intptr(), py() overloads for MemoryPoolHandle

Cython layer:
- Update _resource_handles.pxd with new types and functions
- Update _device_memory_resource.pxd: replace raw handle with MemoryPoolHandle
- Reorder members: _h_pool first (matches Stream/Event pattern)
- Update _device_memory_resource.pyx to use new handle functions
- Update _ipc.pyx to use create_mempool_handle_ipc for IPC imports
- DMR_close now uses RAII (_h_pool.reset()) instead of explicit cuMemPoolDestroy
- Consistent member initialization order across __cinit__, init functions, and close
Introduce DevicePtrHandle (std::shared_ptr<const CUdeviceptr>) to manage
device pointer lifetimes with automatic deallocation. Key features:

- Allocation functions: deviceptr_alloc_from_pool, deviceptr_alloc_async,
  deviceptr_alloc, deviceptr_alloc_host, deviceptr_create_ref
- IPC import via deviceptr_import_ipc with error output parameter
- Deallocation stream stored in mutable DevicePtrBox, accessible via
  deallocation_stream() and set_deallocation_stream()
- cuMemFreeAsync used for deallocation (NULL stream = legacy default)
- Buffer class updated to use DevicePtrHandle instead of raw pointers
- Buffer.handle returns integer for backward compatibility with ctypes
- IPCBufferDescriptor.payload_ptr() helper to simplify casting

Note: IPC-imported pointers do not yet implement reference counting
workaround for nvbug 5570902.
Change all intptr() overloads to return std::intptr_t (signed) instead
of std::uintptr_t per C standard convention for pointer-to-integer
conversion. This addresses issue NVIDIA#1342 which requires Buffer.handle
to return a signed integer.

Fixes NVIDIA#1342
Implement a systematic error handling approach for C++ resource handle
functions using thread-local storage, similar to cudaGetLastError().

API:
- get_last_error(): Returns and clears the last CUDA error
- peek_last_error(): Returns without clearing
- clear_last_error(): Explicitly clears the error

All functions that can fail now set the thread-local error before
returning an empty handle. This allows callers to retrieve specific
CUDA error codes for proper exception propagation.

Updated deviceptr_import_ipc to use this pattern instead of an
output parameter.
IPC-imported device pointers are not correctly reference counted by the
driver - the first cuMemFreeAsync incorrectly unmaps the memory even when
the pointer was imported multiple times.

Work around this by caching imported pointers and returning the same
handle for duplicate imports. The cache uses weak_ptr so entries are
automatically cleaned up when all references are released.

The workaround can be easily bypassed via use_ipc_ptr_cache() when a
driver fix becomes available.
Implements handle-based owner tracking for device pointers, consistent
with the pattern used for streams (create_stream_handle_with_owner).

Changes:
- Add deviceptr_create_with_owner() - creates non-owning handle that
  keeps a Python owner alive via Py_INCREF/DECREF (lambda capture)
- If owner is nullptr, delegates to deviceptr_create_ref
- Buffer._owner field tracks owner in Python for property access
- Buffer._init() simplified to always call deviceptr_create_with_owner
@leofang leofang added this to the cuda.core beta 11 milestone Dec 15, 2025
@Andy-Jost
Copy link
Contributor Author

/ok to test b629ec6

Expose a full C++ handles function table via PyCapsule so extensions can dispatch without RTLD_GLOBAL, and switch resource_handles.cpp to load libcuda symbols at runtime to support CPU-only imports.
Use a lazy PyCapsule in _resource_handles to resolve and cache required CUDA driver entrypoints via cuda.bindings.driver.cuGetProcAddress, and have resource_handles.cpp consume that table on first use. This avoids duplicating driver pathfinding logic and removes dlopen/dlsym linkage requirements.
@rwgk
Copy link
Collaborator

rwgk commented Jan 7, 2026

The latest update:

  1. Refactors py methods to reduce duplicate code
  2. Renames native to cu

On the first point, I was able to get a thread-safe solution without explicit use of atomics. For instance:

static PyObject* cls = detail::py_class_for("CUcontext");

C++ ensures this initialization will occur at most once.

AFAICT this is probably fine for free-threaded Python, but it has the exact same latent deadlock risk in multi-threaded scenarios that we had to solve with pybind/pybind11#4877.

I'm fine if you don't want to add the robust code in this PR, but then we should create an issue to track the latent deadlock risk. We can work on that as a follow-on.


More background:

The latent deadlock risk is explained in detail here: https://github.com/pybind/pybind11/blob/master/docs/advanced/deadlock.md

The deadlock scenario:

  1. Thread A holds GIL, enters static init, acquires C++ static init lock
  2. Some Python C API function releases the GIL during execution
  3. Thread B acquires GIL, enters same function, blocks on C++ static init lock
  4. Thread A tries to reacquire GIL → deadlock

It will most likely work most of the time in practice, the module is usually already imported and the attribute access typically doesn't release the GIL. But based on my pybind11 experience, I'm convinced we're setting ourselves up for flakiness with this code. The pybind11 pattern "worked" for years until numpy 2 came around, and then suddenly became very flaky for reasons I still don't fully understand.

It seems easily avoidable. LLMs can generate a solution that avoids the latent deadlock risk in seconds (e.g., release GIL → std::call_once → reacquire GIL inside), and those few seconds plus ~20 lines of extra code are likely to save hours of dealing with flakiness in the future.

@rwgk
Copy link
Collaborator

rwgk commented Jan 7, 2026

Oh, sorry I missed this comment yesterday:

An alternative might be "Holder," as in StreamHolder, which also preserves the convenient h_stream naming pattern.

I really like Holder; based on what @cpcloud said better than Rc.

I lean toward cu, py, and intptr over the as_* variants since these accessors appear often in nested expressions. In this case, I think the readability benefit of having shorter lines and fewer line wraps outweighs the loss of descriptiveness since handles and their accessors are used so widely that most developers will quickly become familiar with them.

I want to throw in a couple extra arguments:

  • I have overlooked such short names completely many more times than I'd like to admit. "Too short to even be read"
  • It's much more difficult to pin-point cu or py in a large source code file than as_cu or as_py.

// we avoid calling PyEval_SaveThread (which requires holding the GIL).
// It also handles the case where Python is finalizing and GIL operations
// are no longer safe.
class GILReleaseGuard {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: Add a Note documenting the behavior when the GIL isn't actually released:

// Note: If the GIL is not released (finalizing, or not held):
// - Reduces parallelism (other Python threads remain blocked)
// - No deadlock risk as long as the guarded code doesn't call back into Python

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional pretty minor suggestions. Technically this PR looks great to me. My main concern is intuitive naming, I believe Holder is far better than Handle, for the reasons explained under "Handle" implies in my comment from a couple days ago.

// Stream Handles
// ============================================================================

struct StreamBox {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The *Box structs (ContextBox, StreamBox, EventBox, MemoryPoolBox, DevicePtrBox) seem to be pure implementation details used only within this .cpp file. I'd move them into the existing anonymous namespace to provide stronger encapsulation.

};

static DevicePtrBox* get_box(const DevicePtrHandle& h) {
const CUdeviceptr* p = h.get();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested comment (I'd put it inside the function):

// Recovers the owning DevicePtrBox from the aliased CUdeviceptr pointer.
// This works because DevicePtrHandle is a shared_ptr alias pointing to
// &box->resource, so we can compute the containing struct using offsetof.
// The const_cast is safe because we only use this to access the mutable
// h_stream member or in the deleter (where the box is being destroyed).


struct DevicePtrBox {
CUdeviceptr resource;
mutable StreamHandle h_stream;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested comment for h_stream:

    // Mutable to allow set_deallocation_stream() to update the stream
    // through a const DevicePtrHandle. The stream can be changed after
    // allocation (e.g., to synchronize deallocation with a different stream).

Comment on lines 24 to 25
static std::once_flag driver_load_once;
static bool driver_loaded = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe static is redundant inside an anonymous namespace. Harmless but I'd remove it.

I see there are also static declarations scattered elsewhere in the file, for example, static DevicePtrBox* get_box(...) and static void clear_mempool_peer_access(...) in namespace cuda_core. These helper functions (and the *Box structs they work with) are implementation details that don't need external visibility. Moving them all into the anonymous namespace would:

  1. Group all implementation details in one place
  2. Eliminate the need for scattered static declarations
  3. Make the public API surface of namespace cuda_core clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go so far as to group everything into anonymous namespaces but I cleaned up and grouped shared things as much as possible. I kept the structure where each resource type has a section with its Box declaration, helper classes and functions, and public API functions.

cu,
)

_init_handles_table()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to consistently highlight these calls (also in other files) with a comment, so they stand out to readers glancing through, and to refer back to the documentation, e.g.:

# prerequisite before calling handle API functions (see _cpp/Design.md)
_init_handles_table()
<empty line here>

- Refactor py() overloads to use detail::make_py() helper, eliminating
  code duplication
- Remove static caching of Python class objects in py() to avoid
  deadlock between static initialization guard and GIL
- Fix ensure_driver_loaded() by releasing GIL before call_once to
  ensure consistent lock order (guard -> GIL)
- Add "Static Initialization and Deadlock Hazards" section to DESIGN.md
  with link to pybind11 documentation
@Andy-Jost
Copy link
Contributor Author

AFAICT this is probably fine for free-threaded Python, but it has the exact same latent deadlock risk in multi-threaded scenarios that we had to solve with pybind/pybind11#4877.

Thanks for pointing this out. This is a pretty subtle issue that could cause a headache later. I dealt with this in the latest upload as follows:

  • Removed the static class lookup in py functions. The deadlock risk is real and I don't think the extra complexity is justified for these functions. They are just nice-to-haves, not critical to performance.
  • I found an additional instance of this problem related to load_driver_api and updated it to release the GIL before entering std::call_once.
  • I updated the DESIGN.md document to point out the deadlock hazard.

As for the naming, let's put it up to a vote. I have a slight preference for the existing names, but I don't mind changing the code if others prefer the new names. I mainly want to avoid changing multiple times. There are two proposed name changes:

  1. cu/py/intptr -> as_cu/as_py/as_intptr
  2. StreamHandle -> StreamHolder etc.

The second one also involves renaming files, documentation, etc. If anyone has a preference either way, please mention it.

I will get to work on the remaining comments now.

- Add note to GILReleaseGuard about behavior when GIL isn't released
- Add comment for mutable h_stream in DevicePtrBox
- Add comment to get_box() explaining pointer recovery technique
- Add _init_handles_table() comments to all .pyx files referencing DESIGN.md
- Remove redundant static from items inside anonymous namespace
- Wrap *Box structs in anonymous namespace for encapsulation
@Andy-Jost
Copy link
Contributor Author

/ok to test 61326e9

@rwgk
Copy link
Collaborator

rwgk commented Jan 7, 2026

Logging this here FYI / for visibility.


Question

Do you see any danger of reference cycles between the new family of shared_ptr types (StreamHandle, ContextHandle, etc.)?

Analysis

The dependency graph among handle types is:

ContextHandle (no handle dependencies)
    ↑
    ├── StreamHandle (deleter captures ContextHandle)
    │       ↑
    │       └── DevicePtrHandle (box contains StreamHandle, deleter may capture MemoryPoolHandle)
    │
    └── EventHandle (deleter captures ContextHandle)

MemoryPoolHandle (no handle dependencies)

No cycles are possible among the C++ shared_ptr types. The dependencies form a DAG — all arrows point "upward" toward ContextHandle or are one-directional.

Key observations:

  • StreamHandleContextHandle, but ContextHandle never references StreamHandle
  • DevicePtrHandleStreamHandle + MemoryPoolHandle, but neither of those reference DevicePtrHandle
  • StreamBox and EventBox only contain the raw CUDA handle, no other *Handle types

One Potential Concern: Python ↔ C++ Cycles

The _with_owner variants (create_stream_handle_with_owner, deviceptr_create_with_owner) capture a PyObject* via Py_INCREF. If user code creates a situation where:

  1. Python object A holds a handle
  2. That handle's deleter holds a ref to Python object B
  3. B holds a reference to A

...this would be a cycle that Python's GC cannot break (because the C++ Py_INCREF is invisible to the GC's cycle detector).

However, this is a general issue with C extensions holding Python references, not specific to this design. The _with_owner pattern is intended for borrowing handles from foreign objects, where you wouldn't typically store the resulting handle back on the owner.

Conclusion

The C++ handle graph is cycle-free by design. Python-level cycles involving _with_owner are theoretically possible but unlikely in practice.

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for jumping in late, would it be possible to give me a bit more time to review?

@Andy-Jost
Copy link
Contributor Author

No cycles are possible among the C++ shared_ptr types.

Yes, the natural ownership relation implied by the driver should preserve this forever. The driver would need to introduce resources that mutually own one another to break this.

# SPDX-License-Identifier: Apache-2.0

recursive-include cuda/core *.pyx *.pxd
recursive-include cuda/core *.pyx *.pxd *.cpp *.hpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Would this include Cython-generated code? If so we don't want to unconditionally include .cpp files.

cdef:
cydriver.CUevent _handle
EventHandle _h_event
ContextHandle _h_context # Cached for fast access
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment isn't quite accurate, the point of caching context handle is not about faster access, but because there is no way to use a driver API to loop it up (unlike cuStreamGetCtx)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing up the markdown file, Andy. But the added complexity, especially around the new Cython bindings (to the resource handle C++ code), and the GIL handling, warrants at least another discussion, I think. This complexity was not mentioned in the original design docs (or maybe I missed it?), and I am not sure it is the best way to go. I'll schedule a meeting to follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the meeting, I'll follow up to see whether the capsules can be eliminated. See #1450 and #1452

// ============================================================================

// cu() - extract the raw CUDA handle
inline CUcontext cu(const ContextHandle& h) noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to avoid two-letter names like cu and py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule, yes, short names should be avoided. In this case, cu() and py() are ubiquitous operations used throughout the codebase, and longer names would arguably harm readability.

These functions are designed to be nested in expressions, so brevity matters.

For example:

HANDLE_RETURN(cydriver.cuMemcpyAsync(
   cu(dst._h_ptr), cu(self._h_ptr), src_size, cu(s._h_stream)))

With a longer name like get_cuda_handle():

HANDLE_RETURN(cydriver.cuMemcpyAsync(
    get_cuda_handle(dst._h_ptr), get_cuda_handle(self._h_ptr),
    src_size, get_cuda_handle(s._h_stream)))

The longer form requires more lines, more visual parsing, and potentially more temporary variables. Given how frequently these appear in driver call expressions, I believe the short names reduce cognitive load rather than increase it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cu is too short and get_cuda_handle is too long:

  • too short: it's a pain to pin-point such names while reading the code. Searching for cu will have many false positives. cu( will be better, but still. I think even LLMs will have trouble pin-pointing all calls with a short name.

  • too long: cognitive load

The sweet spot is somewhere in the middle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're trying to push this PR through today, I suggest we table this question and follow up, if needed.

Comment on lines +287 to +306
// ============================================================================
// Thread-local error handling
// ============================================================================

// Thread-local status of the most recent CUDA API call in this module.
static thread_local CUresult err = CUDA_SUCCESS;

CUresult get_last_error() noexcept {
CUresult e = err;
err = CUDA_SUCCESS;
return e;
}

CUresult peek_last_error() noexcept {
return err;
}

void clear_last_error() noexcept {
err = CUDA_SUCCESS;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cudart-style error handling seems a lot harder to use than the previous HANDLE_RETURN... What is the motivation of introduce a thread-local state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is another question we should follow up on, if needed.

@Andy-Jost
Copy link
Contributor Author

Related issue for future simplification: #1450 (investigate using __pyx_capi__ to potentially eliminate the PyCapsule mechanisms)

@Andy-Jost
Copy link
Contributor Author

Related issue: #1452 (investigate eliminating _CXX_API capsule by having Cython modules call directly into resource handles code)

- MANIFEST.in: Be specific about C++ files to avoid Cython-generated .cpp
- _event.pxd: Remove inaccurate comment about caching for fast access
- Remove unnecessary experimental/_context.pxd stub
@Andy-Jost
Copy link
Contributor Author

/ok to test b3c6c64

@Andy-Jost
Copy link
Contributor Author

That latest upload fixes issues from Leo's comments. Remaining open questions:

If we can resolve or defer these, then this PR is ready for merge AFAIK.

@rwgk
Copy link
Collaborator

rwgk commented Jan 9, 2026

That latest upload fixes issues from Leo's comments. Remaining open questions:

I think the two-letter names cu and py question should be settled in this PR:

  • Super easy change now
  • But once it's "out there" I wouldn't want to touch it anymore.

My preference would still be to use as_cu, as_py, and for symmetry, as_intptr.

Rename cu() -> as_cu(), py() -> as_py(), intptr() -> as_intptr()
to avoid overly short two-letter function names per review feedback.
@Andy-Jost
Copy link
Contributor Author

/ok to test 10623ee

@Andy-Jost
Copy link
Contributor Author

The latest upload mass-renames py, cu, and intptr to prefix as_.

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Andy/Ralf!

@leofang leofang merged commit acc78f7 into NVIDIA:main Jan 10, 2026
80 checks passed
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Buffer.handle should be intptr_t Implement a handle-based resource management layer

5 participants